Skip to content

Conversation

@zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Dec 3, 2025

Which issue does this PR close?

Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and row groups and return Inexact, now support reverse order case, and we don't need to cache anything for this implementation, so it's no memory overhead.

It will have huge performance improvement with dynamic topk pushdown to skip row groups.

Details:

Performance results on ClickBench sorted data: 13ms vs 300ms baseline (23x faster), close to aggressive caching approach (9.8ms) but with much better memory stability details

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

…. Only re-arrange files and row groups and return Inexact.
@github-actions github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Dec 3, 2025
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review December 3, 2025 09:53
Copilot AI review requested due to automatic review settings December 3, 2025 09:53
@zhuqi-lucas zhuqi-lucas changed the title Draft Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and row groups and return Inexact. Establish the high level API for sort pushdown and the optimizer rule and support reverse files and row groups Dec 3, 2025
Copilot finished reviewing on behalf of zhuqi-lucas December 3, 2025 09:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR establishes the high-level API for sort pushdown optimization in DataFusion, specifically targeting Parquet files. The optimization improves TopK query performance by reordering files and row groups based on statistics, though it returns inexact ordering (keeping the Sort operator for correctness while enabling early termination).

Key Changes:

  • Added enable_sort_pushdown configuration option (default: true) to control the optimization
  • Implemented the optimizer rule PushdownSort that detects SortExec nodes and pushes sort requirements down to data sources
  • Extended ExecutionPlan, DataSource, and FileSource traits with try_pushdown_sort() methods to support the optimization
  • Added reverse_scan_inexact flag to ParquetSource to reverse row group read order when beneficial

Reviewed changes

Copilot reviewed 25 out of 30 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
docs/source/user-guide/configs.md Added documentation for the new enable_sort_pushdown configuration option
datafusion/sqllogictest/test_files/topk.slt Updated test expectation to show reverse_scan_inexact=true in explain output
datafusion/sqllogictest/test_files/spark/bitwise/bit_count.slt Fixed line number formatting in test file
datafusion/sqllogictest/test_files/slt_features.slt Fixed line number formatting in test file
datafusion/sqllogictest/test_files/information_schema.slt Added enable_sort_pushdown to configuration schema output
datafusion/sqllogictest/test_files/explain.slt Updated test expectations to show PushdownSort optimizer phase
datafusion/sqllogictest/test_files/dynamic_filter_pushdown_config.slt Added comprehensive integration tests for sort pushdown with various scenarios
datafusion/sqllogictest/test_files/create_external_table.slt Updated test expectation to show reverse_scan_inexact=true
datafusion/proto/src/logical_plan/file_formats.rs Added enable_sort_pushdown field to protobuf serialization (with hardcoded value bug)
datafusion/proto/src/generated/datafusion_proto_common.rs Added enable_sort_pushdown field to generated protobuf structures
datafusion/proto-common/src/to_proto/mod.rs Implemented protobuf serialization for enable_sort_pushdown (with hardcoded value bug)
datafusion/proto-common/src/generated/prost.rs Added enable_sort_pushdown field to generated protobuf structures
datafusion/proto-common/src/generated/pbjson.rs Added JSON serialization support for enable_sort_pushdown
datafusion/proto-common/src/from_proto/mod.rs Implemented protobuf deserialization for enable_sort_pushdown (with hardcoded value bug)
datafusion/proto-common/proto/datafusion_common.proto Added enable_sort_pushdown field to protobuf schema
datafusion/physical-plan/src/execution_plan.rs Added try_pushdown_sort() trait method to ExecutionPlan
datafusion/physical-optimizer/src/pushdown_sort.rs Implemented the core PushdownSort optimizer rule
datafusion/physical-optimizer/src/optimizer.rs Registered PushdownSort optimizer in the optimization pipeline
datafusion/physical-optimizer/src/lib.rs Exported the new pushdown_sort module
datafusion/execution/src/config.rs Added session config method (with naming and documentation issues)
datafusion/datasource/src/source.rs Added try_pushdown_sort() trait method to DataSource and DataSourceExec
datafusion/datasource/src/file_scan_config.rs Implemented sort pushdown logic for FileScanConfig including file reversal
datafusion/datasource/src/file.rs Added SortOrderPushdownResult enum and try_pushdown_sort() trait method to FileSource
datafusion/datasource-parquet/src/source.rs Implemented sort pushdown for ParquetSource with reverse_scan_inexact flag
datafusion/datasource-parquet/src/opener.rs Implemented row group reversal logic in ParquetOpener
datafusion/core/tests/physical_optimizer/test_utils.rs Added OptimizationTest harness for testing physical optimizers
datafusion/core/tests/physical_optimizer/pushdown_sort.rs Added comprehensive unit tests for the PushdownSort optimizer
datafusion/core/tests/physical_optimizer/mod.rs Registered the pushdown_sort test module
datafusion/common/src/file_options/parquet_writer.rs Added enable_sort_pushdown field to ParquetOptions destructuring
datafusion/common/src/config.rs Added enable_sort_pushdown configuration field (with indentation issue)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation execution Related to the execution crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase 1 Establish the high level API for sort pushdown and the optimizer rule. Only re-arrange files and row groups and return Inexact.

2 participants